![]() |
The Twofish Encryption Algorithm: A 128-Bit Block Cipher
by Bruce Schneier ; John Kelsey ; Doug Whiting ; David Wagner ; Chris Hall ; Niels Ferguson Wiley Computer Publishing, John Wiley & Sons, Inc. ISBN: 0471353817 Pub Date: 03/01/99 |
Previous | Table of Contents | Next |
Empirically, there also seems to be some anomalous behavior of the C compilers. In almost all cases, the encryption and decryption routines in C achieved speeds within a few percent of each other. However, there were cases in the table where the two speeds differed by considerably more than ten percent (we used the larger number in the table), which is very odd because the inner loop C code used is virtually identical. We also noted several cases where compiler switches that seemed unrelated to performance optimization sometimes caused very large changes in timings.
It should be noted that performance numbers for Pentium II processors are almost identical in all cases to those for a Pentium Pro, which is not surprising since Intel claims they have the same core. The Pentium and Pentium MMX achieve almost identical speeds for encryption and decryption, although, as noted above, the MMX key setup times are faster, due mainly to the larger cache size.
The bottom line is that, when comparing the relative performance of different algorithms, using the same language and compiler for all implementations helps to make the comparison meaningful, but it does not guarantee a valid measure of the relative speeds. We have listed many different software performance metrics across platforms and languages to facilitate speed comparisons between Twofish and other algorithms. Our belief is that, on any given platform (e.g., Pentium Pro), the assembly-language performance numbers are the best numbers to use to gauge absolute performance, since they are unaffected by the vagaries and limitations of the compiler (e.g., inability to produce rotate opcodes). High-level languages (e.g., C, Java) are also important because of the ease of porting to different platforms, but once an algorithm becomes standardized, it will ultimately be coded in assembly for the most popular platforms.
RAM, ROM, or EEPROM for Key | Working RAM | Code and Table Size | Clocks per Block | Time per Block @ 4MHz |
---|
24 | 36 | 2200 | 26500 | 6.6 msec |
24 | 36 | 2150 | 32900 | 8.2 msec |
24 | 36 | 2000 | 35000 | 8.7 msec |
24 | 36 | 1750 | 37100 | 9.3 msec |
184 | 36 | 1900 | 15300 | 3.8 msec |
184 | 36 | 1700 | 18100 | 4.5 msec |
184 | 36 | 1450 | 19200 | 4.8 msec |
1208 | 36 | 1300 | 12700 | 3.2 msec |
1208 | 36 | 1100 | 15500 | 3.9 msec |
1208 | 36 | 850 | 16600 | 4.2 msec |
3256 | 36 | 1000 | 11900 | 3.0 msec |
Table 5.4. Twofish Performance on a 6805 Smart Card |
Twofish is ideally suited for smart cards. It can fit on the smallest smart cards while exhibiting reasonable performance, and can take advantage of more RAM or more powerful processors with increased performance. It can operate efficiently in environments that require rapid key changes, and can be implemented in dedicated hardware in only a few gates.
We implemented Twofish on a 6805 CPU, which is a typical smart card processor, with several different spacetime tradeoff options [SKW+99b]. Our results are shown in Table 5.4. The code size includes both encryption and decryption.1 The block encryption and decryption times are almost identical. If only encryption is required, minor improvements in code size and speed can be obtained. The only key schedule precomputation time required in this implementation is the Reed-Solomon mapping used to generate the S-box key material S from the key M, which requires slightly over 1750 clocks per key. This setup time can be made considerably shorter at the cost of two additional 256-byte ROM tables. It should also be observed that the lack of a second index register on the 6805 has a significant impact on the code size and performance, so a different CPU with multiple index registers (e.g., 6502) might be a better fit for Twofish.
1For comparison purposes: DES on a 6805 takes about 1K code, 23 bytes of RAM, and 20000 clock cycles per block.
For any encryption algorithm, memory usage can be divided into two parts: that required to hold the expanded key, and that required as working space to encrypt or decrypt text (including the text block). In applications where a smart card holds a single key for a long period of time, the key can be put into EEPROM or even ROM, greatly reducing RAM requirements. Most applications, however, require the smart card to encrypt using session keys, which change with each transaction. In these situations, the expanded key must be stored in RAM, along with working space to perform the encryption.
Twofishthe 128-bit key versioncan be implemented in a smart card in 60 bytes of RAM. This includes the text block, key, and working space. If a slightly expanded key (16 bytes of the key plus another eight bytes of the Reed-Solomon results (S)) can be stored in ROM or EEPROM, then Twofish can be implemented in only 36 bytes of RAM. In either case, there is zero key setup time for the next encryption operation with the same key.2
2All of our implementations leave the key intact so that it can be used again.
Previous | Table of Contents | Next |